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(57) ABSTRACT 

An editing tool is provided for developing word- 
pronunciation pairs based on a spelled word input. The 
editing tool includes a transcription generator that receives 
the spelled word input from the user and generates a list of 
suggested phonetic transcriptions. The editor displays the 
list of suggested phonetic transcriptions to the user and 
provides a mechanism for selecting the desired pronuncia- 
tion from the list of suggested phonetic transcriptions. The 
editing tool further includes a speech recognizer to aid the 
user in selecting the desired pronunciation from the list of 
suggested phonetic transcriptions based on speech data input 
that corresponds to the spelled word input, and a syllable 
editor that enables the user to manipulate a syllabic part of 
a selected pronunciation. Lastly, the desired pronunciation 
can be tested at any point through the use of a text-to-speech 
synthesizer that generates audible speech data for the 
selected phonetic transcription. 
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SYSTEM FOR DEVELOPING WORD- phonetic knowledge to produce the pronunciation lexicon. It 

PRONUNCIATION PAIRS utilizes various techniques to quickly provide the best pho- 
netic representation of a given word along with different 

BACKGROUND AND SUMMARY OF THE mea ns for "fine tuning" this phonetic representation to 

INVENTION 5 achieve the desired pronunciation. Immediate feedback to 

The present invention relates generally to speech recog- validate word-pronunciation pairs is also provided by incor- 

nit on and speech synthesis systems. More particularly, the Poratmg a text-to-speecb synthesizer. Applications wil 

"ion rites to developing word-pronundation pL. quickly become apparent as developments expand u, areas 

, / ° v . . . . , where exceptions to the rules of pronunciation are common, 

Computer-implemented and automated speech technol- J ^ Qther specialized 

ogy today involves a confluence of many areas of expertise, . 

ranging from linguistics and ^«°^ » .^J 1 ' ° F o r a *L complete understanding of the invention, its 

sienal o recessing and computer science. The traditionally rui a m " lw w * b a Mt nn 

karate problems of text-Lspeech (TVS) synthesis and objects and advantages refer to the following specification 

automatic speech recognition (ASR) actually present many and to the accompanying drawings, 

opportunities to share technology. Traditionally, however, BRIEF DESCRIPTION OF THE DRAWINGS 

speech recognition and speech synthesis has been ^ addressed ^ ^ 

as entirely separate disciplines, relying very little on the r ' u - ™ . • .• 

benefit that pollination could have on both disci- method of the present invenUon; 

0 li nes FIG. 2 illustrates an editing tool useful in implementing a 

We have discovered techniques, described in this docu- 20 system in accordance with the present invention; 

ment for combining speech recognition and speech synthesis FIG. 3 is a block diagram illustrating the presently pre- 

technologies to the mutual advantage of both disciplines in ferred phoneticizer using decision trees; 

generating pronunciation dictionaries. Having a good pro- FIG. 4 is a tree diagram illustrating a letter-only tree used 

nunciation dictionary is key to both text-to-speech and ^ in relation to the phoneticizer; 

automatic speech recognition applications. In the case of FIG. 5 is a tree diagram illustrating a mixed tree in 

text-to-speech, the dictionary serves as the source of pro- accordance with the present invention; 

nunciation for words entered by graphemic or spelled input. pj G 6 is a block diagram illustrating a system for 

In automatic speech recognition applications, the dictionary generating decision trees in accordance with the present 

serves as the lexicon of words that are known by the system. 3Q invention; and 

When training the speech recognition system, this lexicon piG ? ^ a flowchart showing a method for generating 

identifies how each word is phonetically spelled, so that the tra i n i ng data through an alignment process in accordance 

speech models may be properly trained for each of the words ^ ^ ^ qsgq{ inventioni 

In both speech synthesis and speech recognition nPTATT FD DESCRIPTION OF THE 
applications, the quality and performance of the application 35 DE J™R E ? ^EMMDIMErV^ 
may be highly dependent on the accuracy of the pronuncia- FRbPERRfcD bMbUDiMfciN i a 
tion dictionary. Typically, it is expensive and time consum- A word-pronunciation editor 10 for developing word- 
ing to develop a good pronunciation dictionary, because the pronunciation pairs is depicted in FIG. 1. The editor 10 uses 
only way to obtain accurate data has heretofore been through spelled word input 12 to develop word-pronunciation pairs 
use of professional linguists, preferably a single one to 40 that are in turn entered into a lexicon 14. The lexicon 14 of 
guarantee consistency. The linguist painstakingly steps the present invention is a word-pronunciation dictionary 
through each word and provides its phonetic transcription. comprised of ordered pairs of words and one or more 

Phonetic pronunciation dictionaries are available for most associated phonetic transcriptions. As will be more fully 
of the major languages, although these dictionaries typically explained, the lexicon 14 can be updated by addmg word- 
have a limited word coverage and do not adequately handle 45 pronunciation pairs or by revising pronunciations of existing 
proper names, unusual and compound nouns, or foreign word-pronunciation pairs. 

words. Publicly available dictionaries likewise fall short A transcription generator 20 receives as input the spelled 
when used to obtain pronunciations for a dialect different word 12, For illustration purposes it will be assumed that 
from the one for which the system was trained or intended. spelled words 12 are entered via a keyboard, although 
Currently available dictionaries also rarely match all of 50 spelled words may be input through any convenient means, 
the requirements of a given system. Some systems (such as including by voice entry or data file. The transcription 
text-to-spcech systems) need high accuracy; whereas other generator 20 may be configured in a variety of different ways 
systems (such as some automatic speech recognition depending on the system requirements. In a first preferred 
systems) can tolerate lower accuracy, but may require mul- embodiment of the present invention, transcription generator 
tiple valid pronunciations for each word. In general, the 55 20 accesses a baseline dictionary 22 or conventional letter- 
diversity in system requirements compounds the problem. to-sound rules to produce a suggested phonetic transcription 
Because there is no "one size fits all" pronunciation 23. 

dictionary, the construction of good, application-specific In the description presented here, a distinction is made 

dictionaries remains expensive. between phonetic transcriptions and morpheme transenp- 

The present invention provides a system and method for 60 tions. The former represent words in terms of the phonemes 

developing word-pronunciation pairs for use in a pronun- in human speech when the word is spoken, whereas the latter 

ciation dictionary. The invention provides a tool, which represents an atomic unit (called morphs) from which larger 

builds upon a window environment to provide a user- words are made. For instance, a compound word such as 

friendly methodology for defining, manipulating and storing "catwalk" may be treated morphemically as comprising the 

the phonetic representation of word -pronunciation pairs in a 65 atomic units "cat" and "walk". In an alternative 

pronunciation dictionary. Unlike other phonetic transcrip- embodiment, the transcription generator 20 may also mclude 
tion tools, the invention requires no specific linguistic or a morphemic component. 
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In operation, ao initial phonetic transcription of the transcriptions 26 is generated by phoneticizer 24 based upon 

spelled word 12 is derived through a lookup in the baseline the spelled word input 12. If the pronunciation in the 

dictionary 22. In the event no pronunciation is found for the phonemes field 48 is unsatisfactory, then the user preferably 

spelled word, conventional letter-to-sound rules are used to selects one of these phonetic transcriptions (that closely 

generate an initial phonetic transcription. If the resulting 5 matches the desired pronunciation) to populate the pho- 

pronunciation is unsatisfactory to the user, a phoneticizer 24 nemes field 48 il » also envisioned that desired word 

may provide additional suggested pronunciations for the in P ul ma y be spoken by the user This speech input is 

speUed word 12. The phoneticizer 24 generates a list of converted into a spelled word by the speech recognizer 30 

suggested phonetic transcriptions 26 based on the spelled ^VVV™ UaQsIated mt ° a phoaCtlC transcn P tlOD " 

ncscnnCQ shove 

word input using a set of decision trees. Details of a suitable 10 . ..... , . 

phoneticizer are provided below. At anv tune > tf ». u * r can s P« a ^ 10 the ^gu^ selection 

^ , .... j i- * -n- 1_ • box 46 an operative language for the word -pronunciation 

Each transcription in the suggested tot 26 has a numenc editor 1Q ^ £ ^ ^ 1Q automaticall 

value by whtch u can be compared with other transcnpl.ons fa a mo(Je ^ co ds to , he sdected j For 

in the suggested list 26. Topically, these numenc scores are ifl ^ traBSCril \ 0B tor 20 ^ access a dictio . 

the byproduct of the ^ transcription generation mechamsm^ » ^ ^ £ (0 ^ selec(ed , ^ 

For example when the decs.on tree-based phoneticizer 24 rf ^ g ^ ttiasa lioa for the wort) j , u m 

b used, each phone* transcnpt.on has associated with it a ^JvcM & To fu J ioD , [he phoneticizer 

confidence level score. This confidence level score repre- ^ ^ fa ^ 3 0 and the texl-to-speech synthe- 

sents the cumulate score of the md.vidual probabilities sizer36 £ also n 6 eed l0 access j t files and/or ^ 

associated w.th each phoneme. As the reader wiU see from 20 ^ ^ ^ d to ^ ^ ft fa ^ 

the description below, the leaf nodes of each decision tree in envisfoned that the user la may also alter 

the phoneticizer 24 are populated with phonemes and their ^ ce of the ^ In this wa lhe ^ 

associated probabilities. These probabwhes are numerically 1Q ^ devel , of word . pronunc i a ti on pairs 

represented and can be used to generate a confidence level . . „„ t - u n *„„ 
y , « « , , t n ie in the users native language, 
score. Although these confidence level scores are generally 25 It . . , ° . 4 , , 
... , ,? 4 , (U i t rt i Regardless of the language selection, the word- 
not displayed to the user, they are used to order the displayed 6 . 1ft Tj ■ f 

t r . * * j I • it ~ a a u *u pronunciation editor 10 provides various means for m an mu- 

hst of n -best suggested transcriptions 26 as provided by the f ^_ J. 

, 4 . . - Z* 6 r la ting syllabic portions of the phonetic transcription dis- 
Dnoneticizer 24 

F .-,011. played in the phonemes field 48. A phonemic editor 34 (as 

A user selection mechanism 28 allows the user to select a ^ showQ in F , G ^ provides the user a number of 0 p tions for 

pronunciation from the list of suggested transcnpUons 26 modifying an individual syllable of the phonetic transcrip- 

that matches the desired pronunciation. tion For instancej slress ( or emphasis) buttons 50 line up 

An automatic speech recognizer 30 is incorporated into underneath the syllables in phonemes field 48. In this way, 

the editor 10 for aiding the user in quickly selecting the the user can select these buttons 50 to alter the stress applied 

desired pronunciation from the list of suggested transcrip- 35 to the syllable, thereby modifying the pronunciation of the 

tions 26. By using the confidence level score associated with wor d. Most often mispronunciation is a factor of the wrong 

each of the suggested transcriptions, the speech recognizer vowel being used in a syllable. The user can also use the 

30 may be used to reorder the list of suggested transcriptions vowel step through button 52 and/or the vowel table list 54 

26. The speech recognizer 30 extracts phonetic information to select different vowels to substitute for those appearing in 

from a speech input signal 32, which corresponds to the ^ the selected syllable of the phonemes field 48. 

spelled word input 12. Suitable sources of speech include: j n one embodiment of the phonemic editor 34, the user 

live human speech, audio recordings, speech databases, and sp eaks an individual syllable into a microphone (not shown) 

speech synthesizers. The speech recognizer 30 then uses the and the 0T \^ nB \ tex t spelling that corresponds to its pronun- 

speech signal 32 to reorder the list of suggested transcrip- ciat i on ^ provided in the sounds like field 56. When the user 

tions 26, such that the transcription which most closely 45 has selected a particular syllable of the phonetic transcrip- 

corresponds to the speech input signal 32 is placed at the top tion in the phonemes field 48, then a corresponding phone- 

of the list of suggested transcriptions 26. m j c representation of the speech input also replaces this 

As shown in FIG. 2, a graphical user interface 40 is the selected syllable in the phonetic transcription. It should be 

tool by which a user selects and manipulates the phonetic noted that the speech input corresponding to an individual 

transcriptions provided by the transcription generator 20 and 50 syllable is first translated into the corresponding text spelling 

the phoneticizer 24. Initially, the spelled word input 12 is by the speech recognizer 30. The phonemic editor 34 then 

placed into a spelling field 42. If a phonetic transcription of converts this text spelling into the corresponding phonemic 

the word 12 is provided by the baseline dictionary 22, then representation. In this way, one or more selected syllabic 

its corresponding phonetic representation defaults into the portions of the pronunciation may be replaced with a word 

phonemes field 48; otherwise, conventional letter-to-sound 55 known to sound similar to the desired pronunciation, 

rules are used to populate the phonemes field 48. The Alternatively, the phonemic editor 38 presents the user with 

phonemic transcription displayed in the phonemes field 48 is a menu of words based on the spoken vowel sounds and the 

hyphenated to demark the syllables which make up the user selects the word that corresponds to the desired vowel 

word. In this way, a user can directly edit the individual pronunciation of the syllable. If during the editing process 

syllables of the phoneme transcription in the phonemes field 60 the user becomes dissatisfied with the pronunciation dis- 

48. played in the phonemes field 48, then the phonetic transcrip- 

Alternatively, the spelled word input 12 may be selected tion can be reset to its original state by selecting the reset 

from a word list 44 as provided by a word source file (e.g., button 56. 

a dictionary source). Highlighting any word in the word list By clicking on a speaker icon 58, the user may also test 

44 places that word in the spelling field 42 and its corre- 65 the current pronunciation displayed in the phonemes field 

sponding phonetic transcription in the phonemes field 48. As 48. Returning to FIG. 1, a text-to-speech synthesizer 36 

previously discussed, a list of n-best suggested phonetic generates audible speech data 37 from the current pronun- 
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ible I 



elation found in the phonemes field 48. Generating audible 
speech data from a phonetic transcription is well known to 
one skilled in the art. Once the user has completed editing 
the phonetic transcription, a storage mechanism 38 can be 
initiated (via the save button 60) to update the desired 
word-pronunciation pair in lexicon 14. ^ 



The leaf nodes are populated with probability data that 
associate possible phoneme pronunciations with numeric 
values representing the probability that the particular pho- 
neme represents the correct pronunciation of the given letter. 
For example, the notation "iy->0.51" means "the probabil- 
ity of phoneme 'iy* in this leaf is 0.51." The null phoneme, 

Phoneticizer ^ i.e., silence, is represented by the symbol 

An exemplary embodiment of phoneticizer 24 is shown in The sequence generator 78 (FIG. 3) thus uses the letter- 
FIG. 3 to illustrate the principles of generating multiple only decision trees 72 to construct one or more pronuncia- 
pronunciations based on the spelled form of a word. 10 tion hypotheses that are stored in list 80. Preferably, each 
Heretofore, most attempts at spelled word-to-pronunciation pronunciation has associated with it a numerical score 
techniques transcription have relied solely upon the letters arrived at by combining the probability scores of the indi- 
themselves. For some languages, letter-only pronunciation vidual phonemes selected using the decision tree 72. Word 
generators yield satisfactory results; for others (particularly pronunciations may be scored by constructing a matrix of 
English), the results may be unsatisfactory. For example, a is possible combinations and then using dynamic program- 
letter-only pronunciation generator would have great diffi- ming to select the n-best candidates. Alternatively, the n-best 
culty properly pronouncing the word bible. Based on the candidates may be selected using a substitution technique 
sequence of letters only, the letter-only system would likely that first identifies the most probable transcription candidate 
pronounce the word "BIB-L", much as a grade school child and then generates additional candidates through iterative 
learning to read might do. The fault in conventional systems 20 substitution as follows: 

lies in the inherent ambiguity imposed by the pronunciation The pronunciation with the highest probability score is 

rules of many languages. The English language, for selected first by multiplying the respective scores of the 

example, has hundreds of different of pronunciation rules highest-scoring phonemes (identified by examining the leaf 

making it difficult and computationally expensive to nodes), and then using this selection as the most probably 

approach the problem on a word -by-word basis. 25 candidate or first-best word candidate. Additional (n-best) 

Therefore, the presently preferred phoneticizer 24 is a candidates are then selected, by examining the phoneme 

pronunciation generator employing two stages, the first data in the leaf nodes again to identify the phoneme not 

stage employing a set of letter-only decision trees 72 and the previously selected, that has the smallest difference from an 

second, optional stage, employing a set of mixed -decision initially selected phoneme. This minimally-different pho- 

trees 74. Depending on the language and the application, we 30 neme is then substituted for the initially selected one to 

may implement only the first stage (taking as output the thereby generate the second-best word candidate. The above 

pronunciations shown at 80), or implement both stages and process may be repeated iteratively until the desired number 

take the pronunciations output at 84. An input sequence 76, of n-best candidates have been selected. List 80 may be 

such as the sequence of letters B-I-B-L-E, is fed to a sorted in descending score order so that the pronunciation 

dynamic programming phoneme sequence generator 78. The 35 judged the best by the letter-only analysis appears first in the 

sequence generator 78 uses the letter-only trees 72 to gen- list. 

erate a fist of pronunciations 80, representing possible As noted above, a letter-only analysis will frequently 

pronunciation candidates of the spelled word input produce poor results. This is because the letter-only analysis 

sequence. has no way of determining at each letter what phoneme will 

The sequence generator 78 sequentially examines each 40 be generated by subsequent letters. Thus, a letter-only an aly- 

letter in the sequence, applying the decision tree associated sis can generate a high scoring pronunciation that actually 

with that letter to select a phoneme pronunciation for that would not occur in natural speech. For example, the proper 

letter based on probability data contained in the letter-only name, Achilles, would likely result in a pronunciation that 

tree. Preferably, the set of letter-only decision trees includes phoneticizes both "H's": ah-k-ih-l-l-iy-z. In natural speech, 

a decision tree for each a letter in the alphabet. FIG. 4 shows 45 the second "I" is actually silent: ah-k-ih-l-iy-z. The sequence 

an example of a letter-only decision tree for the letter E. The generator using letter-only trees has no mechanism to screen 

decision tree comprises a plurality of internal nodes out word pronunciations that would never occur in natural 

(illustrated as ovals in the Figure), and a plurality of leaf speech. 

nodes (illustrated as rectangles in the Figure). Each internal The second stage of the phoneticizer 24 addresses the 

node is populated with a yes-no question. Yes-no questions 50 above problem. A mixed-tree score estimator 82 uses the set 

are questions that can be answered either yes or no. In the of mixed -decision trees 74 to assess the viability of each 

letter-only tree these questions are directed to the given letter pronunciation in list 80. The score estimator works by 

(in this case the letter E), and its neighboring letters in the sequentially examining each letter in the input sequence 

input sequence. Note in FIG. 4 that each internal node along with the phonemes assigned to each letter by sequence 

branches either left or right, depending on whether the 55 generator 78. Like the set of letter-only trees, the set of 

answer to the associated question is yes or oo. mixed trees has a mixed tree for each letter of the alphabet. 

Abbreviations are used in FIG. 4 as follows: numbers in An exemplary mixed tree is shown in FIG. 5. Like the 

questions, such as "+1" or "-1" refer to positions in the letter-only tree, the mixed tree has internal nodes and leaf 

spelling relative to the current letter. For example, "+1L«» nodes. The internal nodes are illustrated as ovals and the leaf 

'R'?" means "Is the letter after the current letter (which, in 60 nodes as rectangles in FIG. 5. The internal nodes are each 

this case, is the letter E) an R?" The abbreviations CONS populated with a yes-no question and the leaf nodes are each 

and VOW represent classes of letters, namely consonants populated with probability data. Although the tree structure 

and vowels. The absence of a neighboring letter, or null of the mixed tree resembles that of the letter-only tree, there 

letter, is represented by the symbol -, which is used as a is one important difference. The internal nodes of the mixed 

filler or placeholder where aligning certain letters with 65 tree can contain two different classes of questions. An 

corresponding phoneme pronunciations. The symbol # internal node can contain a question about a given letter and 

denotes a word boundary. its neighboring letters in the sequence, or it can contain a 
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question about the phoneme associated with that letter and 
neighboring phonemes corresponding to that sequence. The 
decision tree is thus mixed, in that it contains mixed classes 
of questions. 

The abbreviations used in FIG. 5 are similar to those used 
in FIG. 4, with some additional abbreviations. The symbol 
L represents a question about a letter and its neighboring 
letters. The symbol P represents a question about a phoneme 
and its neighboring phonemes. For example, the question 
"-i-IL^D'?" means "Is the letter in the +1 position a 'D'?" 
The abbreviations CONS and SYL are phoneme classes, 
namely consonant and syllabic. For example, the question 
"+1P==C0NS?" means "Is the phoneme in the +1 position 
a consonant?" The numbers in the leaf nodes give phoneme 
probabilities as they did in the letter-only trees. 

The mixed- tree score estimator rescores each of the 
pronunciations in list 80 based on the mixed-tree questions 
and using the probability data in the lead nodes of the mixed 
trees. If desired, the list of pronunciations may be stored in 
association with the respective score as in list 84. If desired, 
list 84 can be sorted in descending order so that the first 
listed pronunciation is the one with the highest score. 

In many instances, the pronunciation occupying the high- 
est score position in list 80 will be different from the 
pronunciation occupying the highest score position in list 
84. This occurs because the mixed-tree score estimator, 
using the mixed trees 74, screens out those pronunciations 
that do not contain self-consistent phoneme sequences or 
otherwise represent pronunciations that would not occur in 
natural speech. 

The system for generating the letter-only trees and the 
mixed trees is illustrated in FIG. 6. At the heart of the 
decision tree generation system is tree generator 120. The 
tree generator 120 employs a tree -growing algorithm that 
operates upon a predetermined set of training data 122 
supplied by the developer of the system. Typically the 
training data 122 comprise aligned letter, phoneme pairs that 
correspond to known proper pronunciations of words. The 
training data 122 may be generated through the alignment 
process illustrated in FIG. 7. FIG. 7 illustrates an alignment 
process being performed on an exemplary word BIBLE. The 
spelled word 124 and its pronunciation 126 are fed to a 
dynamic programming alignment module 128 which aligns 
the letters of the spelled word with the phonemes of the 
corresponding pronunciation. Note in the illustrated 
example the final E is silent. The letter phoneme pairs are 
then stored as data 122. 

Returning to FIG. 6, the tree generator 120 works in 
conjunction with three additional components: a set of 
possible yes-no questions 130, a set of rules 132 for select- 
ing the best questions for each node or for deciding if the 
node should be a lead node, and a pruning method 133 to 
prevent over-training. 

The set of possible yes-no questions may include letter 
questions 134 and phoneme questions 136, depending on 
whether a letter-only tree or a mixed tree is being grown. 
When growing a letter-only tree, only letter questions 134 
are used; when growing a mixed tree both letter questions 
134 and phoneme questions 136 are used. 

The rules for selecting the best question to populate at 
each node in the presently preferred embodiment are 
designed to follow the Gini criterion. Other splitting criteria 
can be used instead. For more information regarding split- 
ting criteria reference Breiman, Friedman et al, "Classifica- 
tion and Regression Trees." Essentially, the Gini criterion is 
used to select a question from the set of possible yes-no 
questions 130 and to employ a stopping rule that decides 
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when a node is a leaf node. The Gini criterion employs a 
concept called "impurity." Impurity is always a non- 
negative number. It is applied to a node such that a node 
containing equal proportions of all possible categories has 
5 maximum impurity and a node containing only one of the 
possible categories has a zero impurity (the minimum pos- 
sible value). There are several functions that satisfy the 
above conditions. These depend upon the counts of each 
category within a node Gini impurity may be defined as 
follows. If C is the set of classes to which data items can 
belong, and T is the current tree node, let f(l|T) be the 
proportion of training data items in node T that belong to 
class 1, f(2|T) the proportion of items belonging to class 2, 
etc. 

35 ct= E fu\r)mT)^\Y J uu\ r nn- 

To illustrate by example, assume the system is growing a 

20 tree for the letter "E." In a given node T of that tree, the 
system may, for example, have 10 examples of how "E" is 
pronounced in words. In 5 of these examples, "E" is pro- 
nounced "iy" (the sound "ee" in "cheeze); in 3 of the 
examples "E" is pronounced "eh" (the sound of "e" in 

25 "bed"); and in the remaining 2 examples, "E" is "-" (i.e., 
silent as in "e" in "maple"). 

Assume the system is considering two possible yes-no 
questions, Q 1 and Q 2 that can be applied to the 10 examples. 
The items that answer "yes" to Q 2 include four examples of 

30 "iy" and one example of "-" (the other five items answer 
"no" to Q 3 .) The items that answer "yes" to Q 2 include three 
examples of "iy" and three examples of "eh" (the other four 
items answer "no" to Q 2 ). FIG. 6 diagram matically com- 
pares these two cases. 

35 The Gini criterion answers which question the system 
should choose for this node, Q x or Q 2 . The Gini criterion for 
choosing the correct question is: find the question in which 
the drop in impurity in going from parent nodes to children 
nodes is maximized. This impurity drop AT is defined as 

40 A|=i(T)-p ytf /i(yes)-p w *i(no), where is the proportion 
of items going to the "yes" child and P wo is the proportion 
of items going to the "no" child. 
Applying the Gini criterion to the above example: 

45 i{T) = 1 - £ [/(/' ! = 1 " U S 2 ~ °-3 2 " °- 22 = 062 



A| for Q x is thus: 
50 i(T)-P y<r XQ 1 )-l-0.8 2 -0.2 2 =0.32 
i(T)-P^(Q>l-0.2 2 0.6 2 «0.56 
So AKQJ-0.62-0.5'0.32-0.5'0.56-0,18. 

For Q 2) we have |(yes, Q 2 )«l-0.5 2 -0.5 2 =0.5, and for i(no, 
Q2)=(same)=0.5. 
55 So, A|(Q 2 )=0.6-(0.6)*(0.5)-(0.4)*(0.5) SS 0.12. In this case, 
Q 2 gave the greatest drop in impurity. It will therefore be 
chosen instead of Q 2 . 

The rule set 132 declares a best question for a node to be 
that question which brings about the greatest drop in impu- 
60 rity in going from the parent node to its children. 

The tree generator applies the rules 132 to grow a decision 
tree of yes-no questions selected from set 130. The generator 
will continue to grow the tree until the optimal-sized tree has 
been grown. Rules 132 include a set of stopping rules that 
65 will terminate tree growth when the tree is grown to a 
pre-determined size. In the preferred embodiment the tree is 
grown to a size larger than ultimately desired. Then pruning 
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methods 133 are used to cut back the tree to its desired size. 3. The system of claim 1 wherein said transcription 

The pruning method may implement the Breiman technique generator using letter-to-sound rules to produce said pho- 

as described in the reference cited above. netic transcription. 

The tree generator thus generates sets of letter-only trees, 4. The system of claim 1 wherein said phonetic transcrip- 

shown generally at 140 or mixed trees, shown generally at 5 tion further includes accentuation data for the spelled word 

150, depending on whether the set of possible yes- no input. 

questions 130 includes letter-only questions alone or in 5. The system of claim 4 wherein said dictionary storing 

combination with phoneme questions. The corpus of train- accentuation data corresponding to each of the plurality of 

ing data 122 comprises letter, phoneme pairs, as discussed spelled words and said phonemic editor being operative to 

above. In growing letter-only trees, only the letter portions 1Q display and edit the accentuation data associated with said 

of these pairs are used in populating the internal nodes. phonetic transcription. 

Conversely, when growing mixed trees, both the letter and 6. The system of claim 1 wherein said phonemic editor 

phoneme components of the training data pairs may be used provides a language selection mechanism and said transcrip- 

to populate internal nodes. In both instances the phoneme tion generator being connected to a plurality of dictionaries 

portions of the pairs are used to populate the leaf nodes. 15 each of which stores phonetic transcription data in a different 

Probability data associated with the phoneme data in the language, whereby said transcription generator invokes one 

lead nodes are generated by counting the number of occur- of said plurality of dictionaries to produce a phonetic 

rences a given phoneme is aligned with a given letter over transcription that corresponds to the language from said 

the training data corpus. language selection mechanism. 

In one embodiment of the present invention, the editor 10 2Q 7. The system of claim 1 further includes a pronunciation 

is adaptive or self-learning. One or more spelled word- selection mechanism connected to said phonemic editor for 

pronunciation pairs are used to update lexicon 14 as well as selecting at least one of said plurality of scored phonetic 

to supply new training data upon which the phone ticizer 24 transcriptions. 

may be retrained or updated. This can be accomplished by 8. The system of claim 7 wherein said pronunciation 

using the word-pronunciation pairs as new training data 122 25 selection mechanism provides at least one of said plurality 

for generating revised decision trees in accordance with the of scored phonetic transcriptions for updating said decision 

above-described method. Therefore, the self-learning trees. 

embodiment improves its phonetic transcription generation 9. The system of claim 1 wherein the spelled word input 

over time, resulting in even higher quality transcriptions. and said phonetic transcription stored in said lexicon being 

The foregoing discloses and describes merely exemplary 3Q used to retrain said transcription generator, 

embodiments of the present invention. One skilled in the art 10. The system of claim 1 further includes a speech 

will readily recognize from such discussion, and from recognizer connected to said phonemic editor and receptive 

accompanying drawings and claims, that various changes, of speech data corresponding to the spelled word input for 

modifications, and variations can be made therein without rescoring said plurality of scored phonetic transcriptions 

the departing from the spirit and scope of the present 35 based on said speech data. 

invention. 11, The system of claim 1 further includes a speech 

What is claimed is: recognizer receptive of speech data corresponding to the 

1. A system for developing word-pronunciation pairs spelled word input and being operative to produce the 
based on a spelled word input, comprising: spelled word input, whereby said transcription generator 

a transcription generator receptive of the spelled word 40 receptive of the spelled word input from said speech recog- 

input for generating a phonetic transcription that cor- nizer. 

responds to the spelled word input, said phonetic tran- 12. The system of claim 1 further includes a speech 

scription being segmented into syllabic portions; recognizer receptive of speech data for producing a sounds- 

a phoneticizer receptive of the spelled word input for like word corresponding to the speech data, such that said 

producing a plurality of scored phonetic transcriptions, 45 phonemic editor being operative to provide a sounds-like 

where the phoneticizer employs letter-only decision phonetic transcription that corresponds to the sounds-like 

trees and phoneme-mixed decision trees to produce word and replace at least one syllabic portion of said 

said plurality of scored phonetic transcriptions, the phonetic transcription with said sounds-like phonetic tran- 

letter-only decision trees having nodes representing scription. 

questions about a given letter and neighboring letters in 50 13. The system of claim 1 further includes a text-to- 

the spelled word input and the phoneme-mixed deci- speech synthesizer connected to said phonemic editor and 

sion trees having nodes representing questions about a receptive of said phonetic transcription for generating 

phoneme and neighboring phonemes in the spelled speech data. 

word input; 14. A system for developing word-pronunciation pairs 

a phonemic editor connected to said transcription genera- 55 based on a s P elled word in P ut > comprising: 

tor for displaying and editing syllabic portions of said a dictionary for storing phonetic transcription data corre- 

phonetic transcription and connected to said phoneti- sponding to a plurality of spelled words; 

cizer for displaying said plurality of scored phonetic a transcription generator connected to said dictionary and 

transcriptions; and receptive of the spelled word input for producing a 

a storage mechanism for updating a lexicon with the 60 phonetic transcription that corresponds to the spelled 

spelled word input and said phonetic transcription, word input, said phonetic transcription being seg- 

thereby developing the desired word-pronunciation mented into syllabic portions; 

pair. a phoneticizer receptive of the spelled word input for 

2. The system of claim 1 wherein said transcription producing a plurality of scored phonetic transcriptions, 
generator accesses a dictionary to generate said phonetic 65 where the phoneticizer employs letter-only decision 
transcription, the dictionary storing phonetic transcription trees and phoneme-mixed decision trees to produce 
data corresponding to a plurality of spelled words. said plurality of scored phonetic transcriptions, the 
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letter-only decision trees having nodes representing 
questions about a given letter and neighboring letters in 
the spelled word input and the phoneme-mixed deci- 
sion trees having nodes representing questions about a 
phoneme and neighboring phonemes in the spelled 
word input; and 
a phonemic editor connected to said transcription genera- 
tor for displaying and editing syllabic portions of said 
phonetic transcription and connected to said phone ti- 
cizer for displaying said plurality of scored phonetic 
transcriptions, thereby developing the desired word- 
pronunciation pair. 

15. The system of claim 14 wherein said transcription 
generator being operative to produce said phonetic transcrip- 
tion using letter-to-sound rules. 

16. The system of claim 14 further includes a storage 
mechanism for updating a lexicon with the spelled word and 
said phonetic transcription. 

17. The system of claim 14 wherein said phonetic tran- 
scription further includes accentuation data for the spelled 
word input. 

18. The system of claim 17 wherein said dictionary 
storing accentuation data corresponding to each of the 
plurality of spelled words and said phonemic editor being 
operative to display and edit the accentuation data associated 
with said phonetic transcription. 

19. The system of claim 14 wherein the spelled word and 
said phonetic transcription being used to retrain said tran- 
scription generator. 

20. The system of claim 14 wherein said phonemic editor 
provides a language selection mechanism and said transcrip- 
tion generator being connected to a plurality of dictionaries 
each of which stores phonetic transcription data in a different 
language, whereby said transcription generator invokes one 
of said plurality of dictionaries to produce a phonetic 
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transcription that corresponds to the language from said 
language selection mechanism. 

21. The system of claim 14 further includes a pronuncia- 
tion selection mechanism connected to said phonemic editor 

5 for selecting at least one of said plurality of scored phonetic 
transcriptions. 

22. The system of claim 21 wherein said pronunciation 
selection mechanism provides at least one of said plurality 
of scored phonetic transcriptions for updating said decision 

10 trees. 

23. The system of claim 14 further includes a speech 
recognizer connected to said phonemic editor and receptive 
of speech data corresponding to the spelled word input for 
rescoring said plurality of scored phonetic transcriptions 

15 based on said speech data. 

24. The system of claim 14 further includes a speech 
recognizer receptive of speech data corresponding to the 
spelled word input and being operative to produce the 
spelled word input, whereby said transcription generator 

20 receptive of the spelled word input of said speech recog- 
nizer. 

25. The system of claim 14 further includes a speech 
recognizer receptive of speech data for producing a sounds- 
like word corresponding to the speech data, such that said 

25 phonemic editor being operative to provide a sounds-like 
phonetic transcription that corresponds to the sounds-like 
word and replace at least one syllabic portion of said 
phonetic transcription with said sounds-like phonetic tran- 
scription. 

30 26. The system of claim 14 further includes a text-to- 
speech synthesizer connected to said phonemic editor and 
receptive of said phonetic transcription for generating 
speech data. 

* * * * * 



